Install the rmarkdown package
This is a latex equation, \(E = mc^{2}\)
This is a inline code print("this is an inline code")
Set echo = TRUE if you want to include source code in the output
For Spanish beginners see: https://markdown.es/sintaxis-markdown/
R Markdown Cheat Sheet: https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf
Other good one: https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf
Latex symbols: https://en.wikibooks.org/wiki/LaTeX/Mathematics
Binary or binomial classification is the task of classifying the elements of a given set into two groups (predicting which group each one belongs to) (Wikipedia - Binary Classification (2017)).
An example:
In statistics, logistic regression, or logit regression, or logit model[1] is a regression model where the dependent variable (DV) is categorical (Wikipedia - Logistic Regression (2017)).
Find the maths here:
Given \(x\) and \(y\) find \(\hat{y}\) such as \(\hat{y} = P(y = 1 | x)\) where \(0 \leq \hat{y} \leq 1\)
We need to find \(w \in \mathbb{R}^n\) and \(b \in \mathbb{R}\) where \(\hat{y} = \sigma(w^Tx + b) = \sigma(z) \approx y\) and \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
See the graph in R
Sigmoid <- function(x) {
1 / (1 + exp(-x))
}
# feed with data
x <- seq(-5, 5, 0.01)
# and plot
plot(x, Sigmoid(x), col='blue', ylim = c(-.2, 1))
abline(h = 0, v = 0, col = "gray60")
The idea is the following:
If z is very large \(\sigma(z) \approx 1\), but if z is a large negative number \(\sigma(z) \approx 0\)
We use the Sigmoid function because it’s a differentiable function (a piecewise linear function, given 0 value if x < 0.5 and 1 in another case, is not a differentiable one).
As \(\hat{y} \approx y\), we have a loss or error function:
The logistic regression cost function is the following:
\(\ell(y, \hat{y}) = -(ylog(\hat{y}) + (y +1)log(1 - \hat{y})\)
Why this loss function?
Because if \(y = 0\) we want \(\hat{y}\) small, if \(y = 1\) we want a large value for \(\hat{y}\). In another case, we want to penalize the result.
The cost function aggregates the loss function result for every \((x_{i}, y_{i})\). Given n, the sample size, the cost function is: \(\jmath(w, b) = \frac{1}{n}\sum_{1}^{n}\ell(y^{i} \hat{y}^{i}) = -\frac{1}{n}\sum_{1}^{n}((y^{i}log(\hat{y}^{i}) + (y^{i} +1)log(1 - \hat{y}^{i}))\)
Remember: \(\hat{y} = h_{\theta} = \sigma(w^Tx + b)\)
# Ref: https://www.r-bloggers.com/logistic-regression-with-r-step-by-step-implementation-part-2/
# Cost Function
#
CostFunction <- function(parameters, X, Y) {
n <- nrow(X)
# function to apply (%*% Matrix multiplication)
g <- Sigmoid(X %*% parameters)
J <- (1/n) * sum((-Y * log(g)) - ((1 - Y) * log(1 - g)))
return(J)
}
So, our mission is to find w, and b, such that minimizes \(\jmath(w, b)\)
To find maximums and minimums we use derivatives. And here the gradient descent appears (Raschka (2017)).
And, why derivatives?
Use the chain rule to obtain the derivatives.
The learning rate is paramount. If you set a very high learning rate the algorithm might not converge. On the contrary, with a very low learning rate the learning process could be slow (Magdon-Ismail (2017)).
Due to this kind of choices, experience is a high valueble assset (and Data Science become an Art in something aspects)
The algorithm is the following (represents the learning rate):
Repeat until convergence:
\(w = w - \alpha\frac{\partial\jmath(w, b)}{\partial w}\)
\(b = b - \alpha\frac{\partial\jmath(w, b)}{\partial b}\)
end algorithm
Then, remember we want to find W and b that minimizes the cost function not the \(\hat{y}\) function.
\[ \frac{\partial\jmath(w, b)}{\partial \theta_j} = \frac{1}{n} \sum\limits_{i=1}^n (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)} \] (Fok (2012))
where \(h_{\theta} = \sigma\)
As \(\frac{1}{n}\) is a constant, \(\alpha\frac{1}{n}\) is constant too.
Other way to see this, is the following. Consider the algorithm as:
Repeat until convergence:
\(w = w - \alpha\frac{\partial\ell(y, \hat{y})}{\partial w}\)
\(b = b - \alpha\frac{\partial\ell(y, \hat{y})}{\partial b}\)
end algorithm
Then, the \(\frac{1}{n}\) constant dissapear.
NOTE (trick for the derivatives, using the chain rule):
Let \(a\) be the sigmoid functions, such that:
\(a = \sigma(z)\), \(-e^{-x} = 1 - (1 + e^{-x}) = 1 - \frac{1}{a} = \frac{a-1}{a}\)
An algorithm that always terminates with a desired solution in a finite number of iterations is a finite algorithm.
An iterative algorithm is said to converge when as the iterations proceed the output gets closer and closer to a specific value. Convergence to a local minimum can only be guaranteed when the function function is convex.
We can set up a stopping criterion, such as a finite number of iterations or a required improvement in the cost function.
Iterative algorithms are the core of Machine Learning, and for this reason, parallel computing is so important in Machine Learning and Data Science.
Suppose that you are an administrator of a university and you want to know the chance of admission of each applicant based on their two exams. You have historical data from previous applicants which can be used as the training data for logistic regression. Your task is to build a classification model that estimates each applicant’s probability of admission in university (Source:coursera machine learning class and Gondaliya (2013)).
Now, we have understood classification problem that we are going to address. Let us understand the data. In data we have records of previous applicants’ two exams score and label whether applicant got admission or not (1 - if got admission 0 - otherwise).
Analyze your data
#Load data
data <- read.csv("data/4_1_data.csv")
#Create plot
plot(data$score.1, data$score.2, col = as.factor(data$label), xlab = "Score-1", ylab = "Score-2")
In this case we have two predictor variables (two exams’ scores) and label as response variable. The equation is the following:
\(\hat{y} = \sigma(w_1x_1 + w_2*x_2 + b)\)
Then, remember we want to find W and b that minimizes the cost function not the \(\hat{y}\) function.
Let us set predictor and response variables.
#Predictor variables
X <- as.matrix(data[, c(1,2)])
#Add ones to X in the first column (matrix multiplication x b)
X <- cbind(rep(1, nrow(X)), X)
#Response variable
Y <- as.matrix(data$label)
#Intial parameters
initial_parameters <- rep(0, ncol(X))
#Cost at inital parameters
CostFunction(initial_parameters, X, Y)
## [1] 0.6931472
With each step of gradient descent, parameters come closer to the optimal values that will achieve the lowest cost. To do this we will set (i.e.) the iteration parameter to 1200
# We want to minimize the cost function. Then derivate this funcion
TestGradientDescent <- function(iterations = 1200, X, Y) {
# Initialize (b, W)
parameters <- rep(0, ncol(X))
# Check evolution
print(paste("Initial Cost Function value: ",
convergence <- c(CostFunction(parameters, X, Y)), sep = ""))
# updating (b, W) using gradient update
# Derive theta using gradient descent using optim function
# Look for information about the "optim" function (there are other options)
parameters_optimization <- optim(par = parameters, fn = CostFunction, X = X, Y = Y,
control = list(maxit = iterations))
#set parameters
parameters <- parameters_optimization$par
# Check evolution
print(paste("Final Cost Function value: ",
convergence <- c(CostFunction(parameters, X, Y)), sep = ""))
return(parameters)
}
# How to use
parameters <- TestGradientDescent(X = X, Y = Y)
## [1] "Initial Cost Function value: 0.693147180559945"
## [1] "Final Cost Function value: 0.203497704580909"
# probability of admission for student (1 = b, for the calculos)
new_student <- c(1,25,78)
print("Probability of admission for student:")
## [1] "Probability of admission for student:"
print(prob_new_student <- Sigmoid(t(new_student) %*% parameters))
## [,1]
## [1,] 0.01350584
Is he out? :-(
Unit tests are small functions that test your code and help you make sure everything is alright. To do this in R, the testthat package can be used as follows:
## Loading required package: testthat
Now, we can check the code for the TestGradientDescent function.
test_that("Test TestGradientDescent",{
parameters <- TestGradientDescent(X = X, Y = Y)
# probability of admission for student (1 = b, for the calculos)
new_student <- c(1,25,78)
prob_new_student <- Sigmoid(t(new_student) %*% parameters)
print(prob_new_student)
expect_equal(as.numeric(round(prob_new_student, digits = 4)), 0.0135)
# Fail, test
# expect_equal(as.numeric(round(prob_new_student, digits = 4)), 0.0130)
})
## [1] "Initial Cost Function value: 0.693147180559945"
## [1] "Final Cost Function value: 0.203497704580909"
## [,1]
## [1,] 0.01350584
Test the TestGradientDescent function with the training set (4_1_data.csv). Obtain the confusion matrix.
Obtain a graph representing how the cost function evolves depending of the number of iterations.
Explore other options using the optim function (see the methods section of the documentation). Explore other ways in R for estimating the Gradient Descent.
Explain why is not a trivial task to calculate the Gradient Descent. What happens if we have a very high dimensional problem?
Optional (+0.5 - 1 points in final grade).
Explain your work results using the RMarkdown format.
Some ideas from (recommended course) https://www.coursera.org/specializations/deep-learning
Fok, Sam. 2012. “Derivate Cost Function for Logistic Regression.” http://sambfok.blogspot.com.es/2012/08/partial-derivative-logistic-regression.html.
Gondaliya, Amar. 2013. “Logistic Regression with R.” https://www.r-bloggers.com/logistic-regression-with-r-step-by-step-implementation-part-2/.
Magdon-Ismail, M. 2017. “Logistic Regression and Gradient Descent.” http://www.cs.rpi.edu/~magdon/courses/LFD-Slides/SlidesLect09.pdf.
Raschka, Martin. 2017. “Machine Learning FAQ.” https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html.
Wikipedia - Binary Classification. 2017. “Binary Classification.” https://en.wikipedia.org/wiki/Binary_classification.
Wikipedia - Logistic Regression. 2017. “Logistic Regression.” https://en.wikipedia.org/wiki/Logistic_regression.